Memory-hierarchy optimal matrix multiplication-programs
نویسنده
چکیده
In many applications that work on huge data sets, from areas like bioinformatics, data-mining, network analysis, optimization, and simulation, the computation time is a major concern. To shorten this time, we have to take into account the many aspects that determine the running time of a program on a modern computer. One of these aspects is the range of different type of memory, from fast but small to huge but slow. More precisely, there are registers on the CPU, various kinds of caches, main memory and disk, the levels of the memory hierarchy. The data transfer between these levels happens by means of so called I/O-operations that can be very slow, and should be minimized to achieve fast programs. The core computation of many applications that require huge amounts of data can be modeled by sparse matrices, such that general purpose high performance software libraries for abstract operations can be used. One of the basic such operations is the multiplication of a huge sparse matrix with a vector. It has been recognized that this is one of the operations where modern computers operate significantly below their peak CPU-performance, indicating that indeed the data transfer in the memory hierarchy is the bottleneck. In very recent work, we showed a lower bound on the number of data transfers that are necessary to compute the product of an arbitrary sparse matrix with a vector, and a sorting-based algorithm that is asymptotically optimal. Fortunately, in many applications the sparse matrices do have a structure that can be exploited to perform the multiplication with fewer I/O’s. The focus of this project is to (automatically) analyze the structure of a huge sparse matrix with respect to the amount of data transfer that is required to multiply with this matrix. In other words, for a given sparse matrix A we are interested in a program that transforms a vector x into the product Ax with the fewest possible I/O-operations.
منابع مشابه
An Experimental Comparison of Cache-oblivious and Cache-aware Programs DRAFT: DO NOT DISTRIBUTE
Cache-oblivious algorithms have been advanced as a way of circumventing some of the difficulties of optimizing applications to take advantage of the memory hierarchy of modern microprocessors. These algorithms are based on the divide-and-conquer paradigm – each division step creates sub-problems of smaller size, and when the working set of a sub-problem fits in some level of the memory hierarch...
متن کاملCommunication Lower Bounds and Optimal Algorithms for Programs That Reference Arrays — Part 1 (REVISED∗)
Communication, i.e., moving data, between levels of a memory hierarchy or between parallel processors on a network, can greatly dominate the cost of computation, so algorithms that minimize communication can run much faster (and use less energy) than algorithms that do not. Motivated by this, attainable communication lower bounds were established in [12, 13, 4] for a variety of algorithms inclu...
متن کاملAn Algebraic Approach to Cache Memory Characterization for Block Recursive Algorithms
Multiprocessor systems usually have cache or local memory in the memory hierarchy. Obtaining good performance on these systems requires that a program utilizes the cache ef-ciently. In this paper, we address the issue of generating eecient cache based algorithms from tensor product formulas. Tensor product formulas have been used for expressing block recursive algorithms like Strassen's matrix ...
متن کاملCommunication lower bounds and optimal algorithms for programs that reference arrays - Part 1
Communication, i.e., moving data, between levels of a memory hierarchy or between parallel processors on a network, can greatly dominate the cost of computation, so algorithms that minimize communication can run much faster (and use less energy) than algorithms that do not. Motivated by this, attainable communication lower bounds were established in [12, 13, 4] for a variety of algorithms inclu...
متن کاملCommunication-Optimal Convolutional Neural Nets
Efficiently executing convolutional neural nets (CNNs) is important in many machinelearning tasks. Since the cost of moving a word of data, either between levels of a memory hierarchy or between processors over a network, is much higher than the cost of an arithmetic operation, minimizing data movement is critical to performance optimization. In this paper, we present both new lower bounds on d...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007